Skip to content

Optimize RequiredIndices column collection#19448

Open
feichai0017 wants to merge 11 commits intoapache:mainfrom
feichai0017:improve-required-indices-add-expr
Open

Optimize RequiredIndices column collection#19448
feichai0017 wants to merge 11 commits intoapache:mainfrom
feichai0017:improve-required-indices-add-expr

Conversation

@feichai0017
Copy link
Copy Markdown

Which issue does this PR close?

  • Closes #N/A (optimizer maintenance cleanup; no existing issue).

Rationale for this change

  • Streamline RequiredIndices::add_expr to collect column indices in a single traversal, avoiding temporary HashSet builds and duplicate walks. This reduces small per-query overhead in the projection optimizer and removes unused helpers.

What changes are included in this PR?

  • Refactor add_expr to push column indices during TreeNode::apply traversal, including outer-ref subquery expressions.
  • Add local helpers collect_outer_ref_exprs / push_column_index; remove unused outer_columns and redundant imports.
  • No behavioral change; maintenance/perf-only.

Are these changes tested?

  • Ran full cargo test (pass).

Are there any user-facing changes?

  • No user-facing changes.

Copilot AI review requested due to automatic review settings December 22, 2025 04:21
@github-actions github-actions bot added the optimizer Optimizer rules label Dec 22, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the RequiredIndices column collection in the projection optimizer by refactoring from a two-pass approach (using column_refs() + outer_columns()) to a single tree traversal. The goal is to reduce per-query overhead by eliminating temporary HashSet allocations and redundant expression walks.

Key changes:

  • Refactored add_expr() to use a single TreeNode::apply() traversal instead of calling column_refs() followed by outer_columns()
  • Added helper functions collect_outer_ref_exprs() and push_column_index() to support the new traversal approach
  • Removed unused outer_columns() and outer_columns_helper_multi() functions from mod.rs along with the unused HashSet import

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
datafusion/optimizer/src/optimize_projections/required_indices.rs Refactored add_expr() to use single-pass traversal with new helper functions; updated imports to include TreeNode
datafusion/optimizer/src/optimize_projections/mod.rs Removed now-unused outer_columns() and outer_columns_helper_multi() functions; removed unused HashSet import

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datafusion/optimizer/src/optimize_projections/required_indices.rs
@xudong963
Copy link
Copy Markdown
Member

@feichai0017 Do you have a micro benchmark for the change?

@feichai0017
Copy link
Copy Markdown
Author

Not yet, let me run micro benchmark to compare it with old implementation @xudong963

@feichai0017
Copy link
Copy Markdown
Author

Ran the new required_indices micro-bench on an M3 Pro MacBook Pro (36 GB RAM). Results from the latest run:

  • new_wide (200-column projection): ~30 µs (median ~31.3 µs)
  • old_wide: ~42–44 µs (median ~41.2 µs)
  • new_outer_ref: ~26 ns
  • old_outer_ref: ~12–13 ns

Comment thread datafusion/optimizer/src/optimize_projections/required_indices.rs Outdated
Comment thread datafusion/optimizer/src/optimize_projections/required_indices.rs Outdated
Comment thread datafusion/optimizer/benches/required_indices.rs Outdated
Comment thread datafusion/optimizer/benches/required_indices.rs Outdated
Comment thread datafusion/optimizer/benches/required_indices.rs Outdated
Comment thread datafusion/optimizer/benches/required_indices.rs Outdated
@github-actions
Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants